Customer churn is a problem that all companies need to monitor, especially those that depend on subscription-based revenue streams. Customer churn refers to the situation when a customer ends their relationship with a company, and it’s a costly problem. Customers are the fuel that powers a business. Loss of customers impacts sales. Further, it’s much more difficult and costly to gain new customers than it is to retain existing customers. As a result, organizations need to focus on reducing customer churn.
The dataset used for this Keras tutorial is IBM Watson Telco Dataset. According to IBM, the business challenge is:
“A telecommunications company [Telco] is concerned about the number of customers leaving their landline business for cable competitors. They need to understand who is leaving. Imagine that you’re an analyst at this company and you have to find out who is leaving and why.”
We are going to use Keras libraryto develop a sophisticated and highly accurate deep learning model in Python. We walk you through the preprocessing steps, investing time into how to format the data for Keras.
Finally we show you how to get black box (NN) insights.
import pandas as pd
import numpy as np
The dataset includes information about:
df = pd.read_csv("../../data/Telco-Customer-Churn.csv")
df
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | ... | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | ... | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.5 | No |
| 2 | 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | ... | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | 7795-CFOCW | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | ... | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | 9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7038 | 6840-RESVB | Male | 0 | Yes | Yes | 24 | Yes | Yes | DSL | Yes | ... | Yes | Yes | Yes | Yes | One year | Yes | Mailed check | 84.80 | 1990.5 | No |
| 7039 | 2234-XADUH | Female | 0 | Yes | Yes | 72 | Yes | Yes | Fiber optic | No | ... | Yes | No | Yes | Yes | One year | Yes | Credit card (automatic) | 103.20 | 7362.9 | No |
| 7040 | 4801-JZAZL | Female | 0 | Yes | Yes | 11 | No | No phone service | DSL | Yes | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 29.60 | 346.45 | No |
| 7041 | 8361-LTMKD | Male | 1 | Yes | No | 4 | Yes | Yes | Fiber optic | No | ... | No | No | No | No | Month-to-month | Yes | Mailed check | 74.40 | 306.6 | Yes |
| 7042 | 3186-AJIEK | Male | 0 | No | No | 66 | Yes | No | Fiber optic | Yes | ... | Yes | Yes | Yes | Yes | Two year | Yes | Bank transfer (automatic) | 105.65 | 6844.5 | No |
7043 rows × 21 columns
# It appears that some columns in "TotalCharges" are " " instead of None. Let's remove them.
df = df[df["TotalCharges"] != " "]
df = df.reset_index(drop=True)
df
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | ... | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | ... | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.5 | No |
| 2 | 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | ... | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | 7795-CFOCW | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | ... | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | 9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7027 | 6840-RESVB | Male | 0 | Yes | Yes | 24 | Yes | Yes | DSL | Yes | ... | Yes | Yes | Yes | Yes | One year | Yes | Mailed check | 84.80 | 1990.5 | No |
| 7028 | 2234-XADUH | Female | 0 | Yes | Yes | 72 | Yes | Yes | Fiber optic | No | ... | Yes | No | Yes | Yes | One year | Yes | Credit card (automatic) | 103.20 | 7362.9 | No |
| 7029 | 4801-JZAZL | Female | 0 | Yes | Yes | 11 | No | No phone service | DSL | Yes | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 29.60 | 346.45 | No |
| 7030 | 8361-LTMKD | Male | 1 | Yes | No | 4 | Yes | Yes | Fiber optic | No | ... | No | No | No | No | Month-to-month | Yes | Mailed check | 74.40 | 306.6 | Yes |
| 7031 | 3186-AJIEK | Male | 0 | No | No | 66 | Yes | No | Fiber optic | Yes | ... | Yes | Yes | Yes | Yes | Two year | Yes | Bank transfer (automatic) | 105.65 | 6844.5 | No |
7032 rows × 21 columns
df["TotalCharges"].dtype
dtype('O')
# Change the type of columns to more optimal ones (for conveniency, as well as saving space and time)
df["TotalCharges"] = df["TotalCharges"].astype(float)
df["SeniorCitizen"] = df["SeniorCitizen"].astype("category")
for col in df.columns:
if df[col].dtype == "object":
df[col] = df[col].astype("category")
df["tenure"] = df["tenure"].astype(int)
df
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | ... | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | ... | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.50 | No |
| 2 | 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | ... | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | 7795-CFOCW | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | ... | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | 9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7027 | 6840-RESVB | Male | 0 | Yes | Yes | 24 | Yes | Yes | DSL | Yes | ... | Yes | Yes | Yes | Yes | One year | Yes | Mailed check | 84.80 | 1990.50 | No |
| 7028 | 2234-XADUH | Female | 0 | Yes | Yes | 72 | Yes | Yes | Fiber optic | No | ... | Yes | No | Yes | Yes | One year | Yes | Credit card (automatic) | 103.20 | 7362.90 | No |
| 7029 | 4801-JZAZL | Female | 0 | Yes | Yes | 11 | No | No phone service | DSL | Yes | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 29.60 | 346.45 | No |
| 7030 | 8361-LTMKD | Male | 1 | Yes | No | 4 | Yes | Yes | Fiber optic | No | ... | No | No | No | No | Month-to-month | Yes | Mailed check | 74.40 | 306.60 | Yes |
| 7031 | 3186-AJIEK | Male | 0 | No | No | 66 | Yes | No | Fiber optic | Yes | ... | Yes | Yes | Yes | Yes | Two year | Yes | Bank transfer (automatic) | 105.65 | 6844.50 | No |
7032 rows × 21 columns
df.dtypes
customerID category gender category SeniorCitizen category Partner category Dependents category tenure int32 PhoneService category MultipleLines category InternetService category OnlineSecurity category OnlineBackup category DeviceProtection category TechSupport category StreamingTV category StreamingMovies category Contract category PaperlessBilling category PaymentMethod category MonthlyCharges float64 TotalCharges float64 Churn category dtype: object
df = df.drop(columns = ["customerID"]) # Drop the customer ID column
df = df.dropna() # Drop any row that has at least 1 NaN
df = df.reindex(columns=['Churn'] + list(df.columns.drop('Churn'))) # Bring the churn in front
df
| Churn | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | No | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | Yes | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 |
| 1 | No | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | No | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.50 |
| 2 | Yes | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | Yes | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 |
| 3 | No | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | No | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 |
| 4 | Yes | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7027 | No | Male | 0 | Yes | Yes | 24 | Yes | Yes | DSL | Yes | No | Yes | Yes | Yes | Yes | One year | Yes | Mailed check | 84.80 | 1990.50 |
| 7028 | No | Female | 0 | Yes | Yes | 72 | Yes | Yes | Fiber optic | No | Yes | Yes | No | Yes | Yes | One year | Yes | Credit card (automatic) | 103.20 | 7362.90 |
| 7029 | No | Female | 0 | Yes | Yes | 11 | No | No phone service | DSL | Yes | No | No | No | No | No | Month-to-month | Yes | Electronic check | 29.60 | 346.45 |
| 7030 | Yes | Male | 1 | Yes | No | 4 | Yes | Yes | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Mailed check | 74.40 | 306.60 |
| 7031 | No | Male | 0 | No | No | 66 | Yes | No | Fiber optic | Yes | No | Yes | Yes | Yes | Yes | Two year | Yes | Bank transfer (automatic) | 105.65 | 6844.50 |
7032 rows × 20 columns
# Choose the categorical columns in which we will do dummy variables. Remove Churn from those.
categorical_columns = [col for col in df.columns if df[col].dtype.name == "category"] # Get categorical columns
categorical_columns = [col for col in categorical_columns if col != "Churn"] # ignore Churn from the categorical columns
categorical_columns
['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']
from sklearn.model_selection import train_test_split
train_data_raw, test_data_raw = train_test_split(df, test_size=0.2, random_state = 10)
print(f"Train set has: {len(train_data_raw)} rows")
print(f"Test set has: {len(test_data_raw)} rows")
Train set has: 5625 rows Test set has: 1407 rows
train_labels = train_data_raw["Churn"].map({"Yes": 1, "No": 0}) # Make "Yes" and "No" to 1 and 0.
test_labels = test_data_raw["Churn"].map({"Yes": 1, "No": 0})
from sklearn.preprocessing import StandardScaler
def pre_process(df):
# for discretize, right = False is default in R, right = True is default in Python
df['tenure'] = pd.cut(df['tenure'], bins=6, labels = False, right=False) # discretize tenure in 6 categories
df['TotalCharges'] = np.log(df['TotalCharges']) # Make TotalCharges in log scale
df = pd.get_dummies(data = df, columns = categorical_columns + ["tenure"], drop_first = True) # encode columns
scaler = StandardScaler()
df = scaler.fit_transform(df.drop(columns = ["Churn"])) # standardize features
return df
train_data = pre_process(train_data_raw.copy()) # .copy() gets rid of warnings but not essential in new versions of pandas
test_data = pre_process(test_data_raw.copy())
# Number of features
train_data.shape
(5625, 34)
Finally, Deep Learning with Keras in Python!
The first step is to initialize a sequential model, which is the beginning of our Keras model. The sequential model is composed of a linear stack (sequence) of layers.
Note: The first layer needs to have the input_shape, that is the number of features that is getting fed by. In this case it is the number of columns.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import initializers
import tensorflow_addons as tfa # for F1 score
C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_addons\utils\tfa_eol_msg.py:23: UserWarning: TensorFlow Addons (TFA) has ended development and introduction of new features. TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024. Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). For more information see: https://github.com/tensorflow/addons/issues/2807 warnings.warn(
model = keras.Sequential(
[
layers.Dense(16, activation="relu", kernel_initializer="uniform", input_dim = train_data.shape[1]),
layers.Dropout(0.1),
layers.Dense(16, activation="relu", kernel_initializer="uniform"),
layers.Dropout(0.1),
layers.Dense(1, kernel_initializer="uniform", activation = "sigmoid")
]
)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics =['accuracy', tf.keras.metrics.AUC(),
keras.metrics.Precision(), keras.metrics.Recall(), tfa.metrics.F1Score(num_classes = 1, threshold = 0.5)])
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 16) 560
dropout (Dropout) (None, 16) 0
dense_1 (Dense) (None, 16) 272
dropout_1 (Dropout) (None, 16) 0
dense_2 (Dense) (None, 1) 17
=================================================================
Total params: 849
Trainable params: 849
Non-trainable params: 0
_________________________________________________________________
# validation_split: to include 30% of the data for model validation, which prevents overfitting.
results = model.fit(x = train_data, y = train_labels, batch_size = 50, epochs = 35, validation_split = 0.30)
Epoch 1/35 79/79 [==============================] - 2s 7ms/step - loss: 0.6503 - accuracy: 0.7229 - auc: 0.5612 - precision: 0.1346 - recall: 0.0066 - f1_score: 0.0127 - val_loss: 0.5325 - val_accuracy: 0.7305 - val_auc: 0.7991 - val_precision: 0.0000e+00 - val_recall: 0.0000e+00 - val_f1_score: 0.0000e+00 Epoch 2/35 79/79 [==============================] - 0s 2ms/step - loss: 0.4720 - accuracy: 0.7475 - auc: 0.8024 - precision: 0.6347 - recall: 0.1320 - f1_score: 0.2186 - val_loss: 0.4517 - val_accuracy: 0.7879 - val_auc: 0.8246 - val_precision: 0.6520 - val_recall: 0.4571 - val_f1_score: 0.5375 Epoch 3/35 79/79 [==============================] - 0s 2ms/step - loss: 0.4442 - accuracy: 0.7935 - auc: 0.8289 - precision: 0.6626 - recall: 0.4644 - f1_score: 0.5461 - val_loss: 0.4394 - val_accuracy: 0.7909 - val_auc: 0.8347 - val_precision: 0.6449 - val_recall: 0.4989 - val_f1_score: 0.5626 Epoch 4/35 79/79 [==============================] - 0s 4ms/step - loss: 0.4383 - accuracy: 0.7894 - auc: 0.8289 - precision: 0.6343 - recall: 0.5024 - f1_score: 0.5607 - val_loss: 0.4324 - val_accuracy: 0.7962 - val_auc: 0.8390 - val_precision: 0.6419 - val_recall: 0.5516 - val_f1_score: 0.5934 Epoch 5/35 79/79 [==============================] - 0s 3ms/step - loss: 0.4288 - accuracy: 0.7991 - auc: 0.8380 - precision: 0.6601 - recall: 0.5128 - f1_score: 0.5772 - val_loss: 0.4262 - val_accuracy: 0.7998 - val_auc: 0.8440 - val_precision: 0.6620 - val_recall: 0.5253 - val_f1_score: 0.5858 Epoch 6/35 79/79 [==============================] - 0s 4ms/step - loss: 0.4248 - accuracy: 0.7912 - auc: 0.8407 - precision: 0.6387 - recall: 0.5052 - f1_score: 0.5642 - val_loss: 0.4253 - val_accuracy: 0.8039 - val_auc: 0.8447 - val_precision: 0.6422 - val_recall: 0.6154 - val_f1_score: 0.6285 Epoch 7/35 79/79 [==============================] - 0s 4ms/step - loss: 0.4257 - accuracy: 0.7950 - auc: 0.8396 - precision: 0.6478 - recall: 0.5119 - f1_score: 0.5719 - val_loss: 0.4209 - val_accuracy: 0.7986 - val_auc: 0.8465 - val_precision: 0.6501 - val_recall: 0.5473 - val_f1_score: 0.5943 Epoch 8/35 79/79 [==============================] - 0s 4ms/step - loss: 0.4197 - accuracy: 0.7978 - auc: 0.8452 - precision: 0.6581 - recall: 0.5081 - f1_score: 0.5734 - val_loss: 0.4201 - val_accuracy: 0.8045 - val_auc: 0.8471 - val_precision: 0.6559 - val_recall: 0.5780 - val_f1_score: 0.6145 Epoch 9/35 79/79 [==============================] - 0s 4ms/step - loss: 0.4222 - accuracy: 0.7981 - auc: 0.8422 - precision: 0.6566 - recall: 0.5138 - f1_score: 0.5765 - val_loss: 0.4203 - val_accuracy: 0.8033 - val_auc: 0.8468 - val_precision: 0.6504 - val_recall: 0.5846 - val_f1_score: 0.6157 Epoch 10/35 79/79 [==============================] - 0s 4ms/step - loss: 0.4195 - accuracy: 0.8042 - auc: 0.8449 - precision: 0.6707 - recall: 0.5261 - f1_score: 0.5897 - val_loss: 0.4198 - val_accuracy: 0.8051 - val_auc: 0.8471 - val_precision: 0.6537 - val_recall: 0.5890 - val_f1_score: 0.6197 Epoch 11/35 79/79 [==============================] - 0s 5ms/step - loss: 0.4214 - accuracy: 0.7948 - auc: 0.8432 - precision: 0.6453 - recall: 0.5166 - f1_score: 0.5738 - val_loss: 0.4190 - val_accuracy: 0.7992 - val_auc: 0.8471 - val_precision: 0.6559 - val_recall: 0.5363 - val_f1_score: 0.5901 Epoch 12/35 79/79 [==============================] - 0s 5ms/step - loss: 0.4199 - accuracy: 0.7963 - auc: 0.8445 - precision: 0.6571 - recall: 0.4986 - f1_score: 0.5670 - val_loss: 0.4177 - val_accuracy: 0.7992 - val_auc: 0.8488 - val_precision: 0.6593 - val_recall: 0.5275 - val_f1_score: 0.5861 Epoch 13/35 79/79 [==============================] - 0s 4ms/step - loss: 0.4201 - accuracy: 0.7958 - auc: 0.8451 - precision: 0.6543 - recall: 0.5014 - f1_score: 0.5677 - val_loss: 0.4180 - val_accuracy: 0.7998 - val_auc: 0.8482 - val_precision: 0.6512 - val_recall: 0.5538 - val_f1_score: 0.5986 Epoch 14/35 79/79 [==============================] - 0s 5ms/step - loss: 0.4172 - accuracy: 0.7971 - auc: 0.8465 - precision: 0.6484 - recall: 0.5271 - f1_score: 0.5815 - val_loss: 0.4176 - val_accuracy: 0.7992 - val_auc: 0.8485 - val_precision: 0.6543 - val_recall: 0.5407 - val_f1_score: 0.5921 Epoch 15/35 79/79 [==============================] - 0s 4ms/step - loss: 0.4163 - accuracy: 0.8011 - auc: 0.8470 - precision: 0.6638 - recall: 0.5195 - f1_score: 0.5828 - val_loss: 0.4195 - val_accuracy: 0.7986 - val_auc: 0.8472 - val_precision: 0.6420 - val_recall: 0.5714 - val_f1_score: 0.6047 Epoch 16/35 79/79 [==============================] - 0s 4ms/step - loss: 0.4123 - accuracy: 0.8014 - auc: 0.8509 - precision: 0.6619 - recall: 0.5261 - f1_score: 0.5862 - val_loss: 0.4183 - val_accuracy: 0.7962 - val_auc: 0.8480 - val_precision: 0.6555 - val_recall: 0.5143 - val_f1_score: 0.5764 Epoch 17/35 79/79 [==============================] - 1s 7ms/step - loss: 0.4161 - accuracy: 0.8021 - auc: 0.8476 - precision: 0.6557 - recall: 0.5480 - f1_score: 0.5970 - val_loss: 0.4186 - val_accuracy: 0.7986 - val_auc: 0.8477 - val_precision: 0.6667 - val_recall: 0.5055 - val_f1_score: 0.5750 Epoch 18/35 79/79 [==============================] - 0s 6ms/step - loss: 0.4125 - accuracy: 0.8049 - auc: 0.8510 - precision: 0.6770 - recall: 0.5176 - f1_score: 0.5867 - val_loss: 0.4191 - val_accuracy: 0.7998 - val_auc: 0.8474 - val_precision: 0.6620 - val_recall: 0.5253 - val_f1_score: 0.5858 Epoch 19/35 79/79 [==============================] - 0s 5ms/step - loss: 0.4109 - accuracy: 0.8075 - auc: 0.8518 - precision: 0.6775 - recall: 0.5347 - f1_score: 0.5977 - val_loss: 0.4194 - val_accuracy: 0.8009 - val_auc: 0.8473 - val_precision: 0.6413 - val_recall: 0.5934 - val_f1_score: 0.6164 Epoch 20/35 79/79 [==============================] - 1s 11ms/step - loss: 0.4142 - accuracy: 0.8016 - auc: 0.8491 - precision: 0.6604 - recall: 0.5318 - f1_score: 0.5892 - val_loss: 0.4195 - val_accuracy: 0.8033 - val_auc: 0.8466 - val_precision: 0.6534 - val_recall: 0.5758 - val_f1_score: 0.6121 Epoch 21/35 79/79 [==============================] - 1s 8ms/step - loss: 0.4152 - accuracy: 0.8021 - auc: 0.8489 - precision: 0.6635 - recall: 0.5280 - f1_score: 0.5880 - val_loss: 0.4198 - val_accuracy: 0.7998 - val_auc: 0.8467 - val_precision: 0.6648 - val_recall: 0.5187 - val_f1_score: 0.5827 Epoch 22/35 79/79 [==============================] - 0s 6ms/step - loss: 0.4110 - accuracy: 0.8113 - auc: 0.8520 - precision: 0.6832 - recall: 0.5489 - f1_score: 0.6087 - val_loss: 0.4208 - val_accuracy: 0.8004 - val_auc: 0.8454 - val_precision: 0.6513 - val_recall: 0.5582 - val_f1_score: 0.6012 Epoch 23/35 79/79 [==============================] - 0s 6ms/step - loss: 0.4117 - accuracy: 0.8019 - auc: 0.8519 - precision: 0.6639 - recall: 0.5252 - f1_score: 0.5864 - val_loss: 0.4185 - val_accuracy: 0.7980 - val_auc: 0.8473 - val_precision: 0.6454 - val_recall: 0.5560 - val_f1_score: 0.5974 Epoch 24/35 79/79 [==============================] - 1s 6ms/step - loss: 0.4137 - accuracy: 0.8070 - auc: 0.8499 - precision: 0.6709 - recall: 0.5461 - f1_score: 0.6021 - val_loss: 0.4186 - val_accuracy: 0.7986 - val_auc: 0.8472 - val_precision: 0.6463 - val_recall: 0.5582 - val_f1_score: 0.5991 Epoch 25/35 79/79 [==============================] - 0s 5ms/step - loss: 0.4111 - accuracy: 0.8080 - auc: 0.8526 - precision: 0.6818 - recall: 0.5290 - f1_score: 0.5957 - val_loss: 0.4205 - val_accuracy: 0.7968 - val_auc: 0.8455 - val_precision: 0.6451 - val_recall: 0.5473 - val_f1_score: 0.5922 Epoch 26/35 79/79 [==============================] - 0s 5ms/step - loss: 0.4100 - accuracy: 0.8059 - auc: 0.8535 - precision: 0.6690 - recall: 0.5432 - f1_score: 0.5996 - val_loss: 0.4194 - val_accuracy: 0.7998 - val_auc: 0.8467 - val_precision: 0.6560 - val_recall: 0.5407 - val_f1_score: 0.5928 Epoch 27/35 79/79 [==============================] - 0s 4ms/step - loss: 0.4086 - accuracy: 0.8103 - auc: 0.8545 - precision: 0.6817 - recall: 0.5451 - f1_score: 0.6058 - val_loss: 0.4194 - val_accuracy: 0.7998 - val_auc: 0.8464 - val_precision: 0.6568 - val_recall: 0.5385 - val_f1_score: 0.5918 Epoch 28/35 79/79 [==============================] - 0s 4ms/step - loss: 0.4065 - accuracy: 0.8075 - auc: 0.8563 - precision: 0.6729 - recall: 0.5451 - f1_score: 0.6023 - val_loss: 0.4192 - val_accuracy: 0.7986 - val_auc: 0.8469 - val_precision: 0.6542 - val_recall: 0.5363 - val_f1_score: 0.5894 Epoch 29/35 79/79 [==============================] - 0s 4ms/step - loss: 0.4078 - accuracy: 0.8085 - auc: 0.8554 - precision: 0.6825 - recall: 0.5309 - f1_score: 0.5972 - val_loss: 0.4193 - val_accuracy: 0.7986 - val_auc: 0.8468 - val_precision: 0.6486 - val_recall: 0.5516 - val_f1_score: 0.5962 Epoch 30/35 79/79 [==============================] - 0s 4ms/step - loss: 0.4075 - accuracy: 0.8118 - auc: 0.8554 - precision: 0.6797 - recall: 0.5603 - f1_score: 0.6143 - val_loss: 0.4190 - val_accuracy: 0.8004 - val_auc: 0.8471 - val_precision: 0.6603 - val_recall: 0.5341 - val_f1_score: 0.5905 Epoch 31/35 79/79 [==============================] - 0s 4ms/step - loss: 0.4078 - accuracy: 0.8065 - auc: 0.8553 - precision: 0.6686 - recall: 0.5480 - f1_score: 0.6023 - val_loss: 0.4198 - val_accuracy: 0.7974 - val_auc: 0.8461 - val_precision: 0.6483 - val_recall: 0.5429 - val_f1_score: 0.5909 Epoch 32/35 79/79 [==============================] - 0s 4ms/step - loss: 0.4116 - accuracy: 0.8100 - auc: 0.8524 - precision: 0.6767 - recall: 0.5546 - f1_score: 0.6096 - val_loss: 0.4195 - val_accuracy: 0.7968 - val_auc: 0.8464 - val_precision: 0.6591 - val_recall: 0.5099 - val_f1_score: 0.5750 Epoch 33/35 79/79 [==============================] - 0s 3ms/step - loss: 0.4046 - accuracy: 0.8100 - auc: 0.8574 - precision: 0.6775 - recall: 0.5527 - f1_score: 0.6088 - val_loss: 0.4198 - val_accuracy: 0.7986 - val_auc: 0.8463 - val_precision: 0.6542 - val_recall: 0.5363 - val_f1_score: 0.5894 Epoch 34/35 79/79 [==============================] - 0s 4ms/step - loss: 0.4029 - accuracy: 0.8110 - auc: 0.8586 - precision: 0.6859 - recall: 0.5413 - f1_score: 0.6051 - val_loss: 0.4194 - val_accuracy: 0.7950 - val_auc: 0.8467 - val_precision: 0.6423 - val_recall: 0.5407 - val_f1_score: 0.5871 Epoch 35/35 79/79 [==============================] - 0s 3ms/step - loss: 0.4040 - accuracy: 0.8108 - auc: 0.8589 - precision: 0.6795 - recall: 0.5537 - f1_score: 0.6102 - val_loss: 0.4192 - val_accuracy: 0.7986 - val_auc: 0.8467 - val_precision: 0.6478 - val_recall: 0.5538 - val_f1_score: 0.5972
from matplotlib import pyplot as plt
%matplotlib inline
plt.plot(results.history['accuracy'], '-o') # 'o' is to show the markers, '-' is to draw the line
plt.plot(results.history['val_accuracy'], '-o')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='lower right')
<matplotlib.legend.Legend at 0x1b5f570eaf0>
High validation accuracy. Once validation accuracy curve begins to flatten or decrease, it’s time to stop training.
plt.plot(results.history['loss'], '-o')
plt.plot(results.history['val_loss'], '-o')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper right')
<matplotlib.legend.Legend at 0x1b5f58193a0>
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go
### Alternative for interactive plots using plotly - will need to rerun every time jupyter opens
plot_df = pd.DataFrame(results.history)
fig = make_subplots(rows=2, cols=1, shared_xaxes=True)
fig.add_trace(go.Scatter(y=plot_df['loss'], mode='lines+markers', name='training loss'), row=1, col=1)
fig.add_trace(go.Scatter(y=plot_df['val_loss'], mode='lines+markers', name='validation loss'), row=1, col=1)
fig.add_trace(go.Scatter(y=plot_df['accuracy'], mode='lines+markers', name='training accuracy'), row=2, col=1)
fig.add_trace(go.Scatter(y=plot_df['val_accuracy'], mode='lines+markers', name='validation accuracy'), row=2, col=1)
fig.update_xaxes(title_text="epoch", row = 2, col = 1)
fig.update_yaxes(title_text="accuracy", row = 2, col = 1)
fig.update_yaxes(title_text="loss", row = 1, col = 1)
fig
Let’s make some predictions from our keras model on the test data set, which was unseen during modeling.
# When we did model.compile above, we gave all the performance metrics that we wanted the model to return
test_loss, test_acc, test_auc, test_precision, test_recall, f1_score = model.evaluate(test_data, test_labels)
44/44 [==============================] - 0s 2ms/step - loss: 0.3980 - accuracy: 0.8117 - auc: 0.8554 - precision: 0.6446 - recall: 0.5928 - f1_score: 0.6176
print(f"The loss on the test data is {test_loss}, the accuracy is {test_acc}")
The loss on the test data is 0.39803919196128845, the accuracy is 0.8116559982299805
ROC Area Under the Curve (AUC) measurement
print(f"The AUC is {test_auc}")
The AUC is 0.8554180860519409
Precision is when the model predicts “yes”, how often is it actually “yes”. Recall (also true positive rate) is when the actual value is “yes” how often is the model correct
print(f"Precision: {test_precision}, Recall: {test_recall}")
Precision: 0.6445783376693726, Recall: 0.5927977561950684
weighted average between the precision and recal
print(f"F1 Score: {f1_score[0]}")
F1 Score: 0.6176046133041382
predictions_prob = model.predict(test_data)
predictions = np.round(predictions_prob)
44/44 [==============================] - 0s 1ms/step
# Test data and predictions
result_df = pd.DataFrame({"truth": test_data_raw["Churn"].reset_index(drop=True),
"estimate": pd.Series(predictions.flatten()).map({1: "Yes", 0: "No"}),
"class_prob": predictions_prob.flatten()})
result_df
| truth | estimate | class_prob | |
|---|---|---|---|
| 0 | No | No | 0.006033 |
| 1 | No | No | 0.003007 |
| 2 | No | No | 0.153355 |
| 3 | No | No | 0.091527 |
| 4 | No | No | 0.045003 |
| ... | ... | ... | ... |
| 1402 | No | No | 0.079841 |
| 1403 | No | No | 0.059549 |
| 1404 | No | No | 0.006350 |
| 1405 | No | No | 0.157536 |
| 1406 | No | No | 0.172290 |
1407 rows × 3 columns
# to show 10 random samples, because the above happened to be 'No' only
result_df.sample(10)
| truth | estimate | class_prob | |
|---|---|---|---|
| 1326 | No | No | 0.014273 |
| 706 | No | No | 0.215997 |
| 53 | Yes | No | 0.116361 |
| 353 | No | No | 0.097059 |
| 1149 | Yes | Yes | 0.710942 |
| 121 | Yes | No | 0.282029 |
| 478 | Yes | Yes | 0.757679 |
| 814 | No | No | 0.112788 |
| 47 | No | No | 0.026224 |
| 156 | No | No | 0.331940 |
from sklearn.metrics import confusion_matrix
conf_matrix = confusion_matrix(result_df["truth"], result_df["estimate"] , normalize='pred')
conf_matrix
array([[0.86325581, 0.35542169],
[0.13674419, 0.64457831]])
conf_matrix = confusion_matrix(result_df["truth"], result_df["estimate"])
conf_matrix
array([[928, 118],
[147, 214]], dtype=int64)
from sklearn.metrics import ConfusionMatrixDisplay
disp = ConfusionMatrixDisplay(conf_matrix, display_labels=["No", "Yes"])
disp.plot()
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1b5f05e4f70>